What Is the Optimal Bin Size of a Histogram: An Informal Description
نویسندگان
چکیده
A natural way to estimate the probability density function of an unknown distribution from the sample of data points is to use histograms. The accuracy of the estimate depends on the size of the histogram’s bins. There exist heuristic rules for selecting the bin size. In this paper, we show that these rules indeed provide the optimal value of the bin size. 1 Formulation of the Problem Need to estimate pdfs. One of the most frequent ways to describe a probability distribution is by specifying its probability density function (pdf) ρ(x) def = dp dx = lim h→0 Prob(X ∈ [x, x+ h]) h . In many practical situations, all we know about a probability distribution is a sample of data points corresponding to this distribution. How can we estimate the pdf based on this sample? Enter histograms. A natural way to estimate the limit when h tends to 0 is to consider the value of the ratio corresponding to some small h: ρ(x) ≈ Prob(X ∈ [x, x+ h]) h . To use this expression, we need to approximate the corresponding probabilities Prob(X ∈ [x, x + h]). By definition, the probability of an event is the limit of this event’s frequency when the number of data points increases. In particular, Prob(X ∈ [x, x+ h]) = lim n→∞ n([x, x+ h]) n ,
منابع مشابه
A recipe for optimizing a time-histogram
The time-histogram method is a handy tool for capturing the instantaneous rate of spike occurrence. In most of the neurophysiological literature, the bin size that critically determines the goodness of the fit of the time-histogram to the underlying rate has been selected by individual researchers in an unsystematic manner. We propose an objective method for selecting the bin size of a time-his...
متن کاملInformation-Theoretically Optimal Histogram Density Estimation
We regard histogram density estimation as a model selection problem. Our approach is based on the information-theoretic minimum description length (MDL) principle. MDLbased model selection is formalized via the normalized maximum likelihood (NML) distribution, which has several desirable optimality properties. We show how this approach can be applied for learning generic, irregular (variable-wi...
متن کاملA Method for Selecting the Bin Size of a Time Histogram
The time histogram method is the most basic tool for capturing a time dependent rate of neuronal spikes. Generally in the neurophysiological literature, the bin size that critically determines the goodness of the fit of the time histogram to the underlying spike rate has been subjectively selected by individual researchers. Here, we propose a method for objectively selecting the bin size from t...
متن کاملHeuristic and exact algorithms for Generalized Bin Covering Problem
In this paper, we study the Generalized Bin Covering problem. For this problem an exact algorithm is introduced which can nd optimal solution for small scale instances. To nd a solution near optimal for large scale instances, a heuristic algorithm has been proposed. By computational experiments, the eciency of the heuristic algorithm is assessed.
متن کاملMaximizing the entropy of histogram bar heights to explore neural activity: a simulation study on auditory and tactile fibers.
Neurophysiologists often use histograms to explore patterns of activity in neural spike trains. The bin size selected to construct a histogram is crucial: too large bin widths result in coarse histograms, too small bin widths expand unimportant detail. Peri-stimulus time (PST) histograms of simulated nerve fibers were studied in the current article. This class of histograms gives information ab...
متن کامل